Skip to content

Add MiniMax-M2.5 FP8 vLLM benchmark for B200#757

Open
functionstackx wants to merge 19 commits intomainfrom
claude/issue-756-20260219-0235
Open

Add MiniMax-M2.5 FP8 vLLM benchmark for B200#757
functionstackx wants to merge 19 commits intomainfrom
claude/issue-756-20260219-0235

Conversation

@functionstackx
Copy link
Contributor

@functionstackx functionstackx commented Feb 19, 2026

Add MiniMax-M2.5 FP8 vLLM benchmark configuration for B200 GPUs.

  • New benchmark script benchmarks/single_node/minimaxm2.5_fp8_b200.sh
  • Config entry minimaxm2.5-fp8-b200-vllm in nvidia-master.yaml
  • Image: vllm/vllm-openai:v0.16.0-cu130
  • TP=2 and TP=4, concurrency 4-64
  • Sequence lengths: 1k1k, 1k8k, 8k1k

Closes #756

Generated with Claude Code

github-actions bot and others added 2 commits February 19, 2026 13:46
- Add benchmark script benchmarks/minimaxm2.5_fp8_b200.sh
- Add config entry minimaxm2.5-fp8-b200-vllm to nvidia-master.yaml
  - Image: vllm/vllm-openai:v0.15.1-cu130
  - Model: MiniMaxAI/MiniMax-M2.5
  - TP=2 and TP=4, concurrency 4-64
  - Sequence lengths: 1k1k, 1k8k, 8k1k
- Update perf-changelog.yaml

Closes #756

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@functionstackx functionstackx force-pushed the claude/issue-756-20260219-0235 branch from 34f7756 to cbc5d0b Compare February 19, 2026 18:46
@kedarpotdar-nv kedarpotdar-nv self-assigned this Feb 19, 2026
@kedarpotdar-nv
Copy link
Collaborator

@claude please review. i made a few changes

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 20, 2026

Claude finished @kedarpotdar-nv's task in 2m 40s —— View job


Review of PR #757

  • Gather context and read changed files
  • Review benchmark script
  • Review nvidia-master.yaml config
  • Review perf-changelog.yaml
  • Post review feedback

Bug: Wrong benchmark_lib.sh source path

benchmarks/single_node/minimaxm2.5_fp8_b200.sh:3 sources benchmark_lib.sh with:

source "$(dirname "$0")/benchmark_lib.sh"

This resolves to benchmarks/single_node/benchmark_lib.sh, which does not exist. The file is at benchmarks/benchmark_lib.sh. Every other script in single_node/ uses /../:

source "$(dirname "$0")/../benchmark_lib.sh"

This will cause the script to fail immediately at runtime.

Fix this →


Benchmark Script (minimaxm2.5_fp8_b200.sh)

Good changes:

  • --enable-expert-parallel (line 32) is appropriate — MiniMax-M2.5 is a MoE model (456B total, ~47B active params). The MI355X version (minimaxm2.5_fp8_mi355x.sh) does NOT have this flag, so this is a deliberate addition for B200.
  • --trust-remote-code on both the vLLM serve and benchmark client side is correct for this model.
  • --block-size=32, --gpu-memory-utilization 0.95, and the overall script structure match existing patterns well.

No explicit FP8 quantization flags: The script doesn't pass --quantization fp8 or --kv-cache-dtype fp8. This is fine since MiniMax-M2.5 appears to ship with native FP8 weights, matching the approach in the MI355X version.


Config (nvidia-master.yaml)

The config entry at line 3018 looks correct:

  • runner: b200-nvs — matches the appropriate B200 runner
  • image: vllm/vllm-openai:v0.15.1-cu130 — new vLLM version, first config to use this tag
  • Search space (TP=2 and TP=4, conc 4-64) across all three seq-len configs is reasonable

No issues found here.


Changelog (perf-changelog.yaml)

Content is correct. Minor: trailing whitespace on line 690 (pr-link: line) and trailing blank line at 692. Not a blocker.


Summary

One blocking issue: the benchmark_lib.sh source path bug will prevent the script from running. Everything else looks good — the --enable-expert-parallel addition and overall config are solid.
| Branch

@functionstackx
Copy link
Contributor Author

@claude implement this fix in this PR's runners/launch_b200-dgxc-slurm.sh too

image

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Feb 24, 2026

Claude finished @functionstackx's task in 1m 5s —— View job


Implementing fix in runners/launch_b200-dgxc-slurm.sh

  • Gather context and understand the fix from the screenshot
  • Implement the fix — change HF_HUB_CACHE_MOUNT path from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache
  • Commit and push changes

Updated HF_HUB_CACHE_MOUNT path in runners/launch_b200-dgxc-slurm.sh:216 from /home/sa-shared/gharunners/hf-hub-cache/ to /scratch/fsw/gharunners/hf-hub-cache to match the correct mount point.

View job | Branch

Change HF_HUB_CACHE_MOUNT from /home/sa-shared/gharunners/hf-hub-cache/
to /scratch/fsw/gharunners/hf-hub-cache

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
@cquil11 cquil11 requested a review from a team February 25, 2026 19:06
@cquil11
Copy link
Collaborator

cquil11 commented Feb 25, 2026

@functionstackx
Copy link
Contributor Author

@ankursingh-nv i cancelled the workflow as it seems to be hard erroring out

@ankursingh-nv ankursingh-nv force-pushed the claude/issue-756-20260219-0235 branch from 66a55e0 to b56773c Compare March 4, 2026 23:06
@ankursingh-nv ankursingh-nv requested a review from jgangani as a code owner March 4, 2026 23:06
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--enable-expert-parallel \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankursingh-nv can u set the ep setting in nvidia-master.yaml and then in this script do if ep is set then enable-experrt-parallel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude create an new PR from remote main that talks situtations where vllm serve doesnt use expert-parallel-size directly --enable-expert-parallel in CLAUDE.md, claude.yml, claude-review.yml

Copy link
Collaborator

@Klaud-Cold Klaud-Cold Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @functionstackx's task in 2m 7s —— View job


Creating PR for --enable-expert-parallel documentation

  • Gather context: read CLAUDE.md, claude.yml, claude-pr-review.yml
  • Create new branch from remote main
  • Add documentation about --enable-expert-parallel vs --expert-parallel-size usage
  • Commit and push changes
  • Create PR

Created PR #864: Document expert parallelism flag differences in Claude configs

Changes made to claude.yml and claude-pr-review.yml:

  • vLLM uses --enable-expert-parallel (boolean flag) — does NOT accept --expert-parallel-size
  • SGLang uses --expert-parallel-size N (explicit integer)
  • ATOM (AMD vLLM fork) uses --enable-expert-parallel (same as vLLM)
  • Scripts must conditionally enable --enable-expert-parallel based on EP_SIZE env var, not hardcode it
  • Added PR review validation rule to flag hardcoded --enable-expert-parallel

Note: There is no CLAUDE.md file in this repo — the Claude instructions live in the workflow prompt: fields in claude.yml and claude-pr-review.yml.
| Branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankursingh-nv @kedarpotdar-nv can u also link vllm recipes for minimax and kimi k2.5 in the official nvidia docs too

https://docs.nvidia.com/llm-inference-quick-start-recipes/index.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me address the ep setting.

@functionstackx
Copy link
Contributor Author

@claude update PR description, seems like the image is now 0.16 and not 0.15.1

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 5, 2026

Claude finished @functionstackx's task in 32s —— View job


Updating PR Description

  • Check current image version in config
  • Update PR description

Updated the PR description to reflect the correct image version: vllm/vllm-openai:v0.16.0-cu130 (was v0.15.1-cu130).

| Branch

Comment on lines +26 to +27
export VLLM_USE_FLASHINFER_MOE_FP8=0
export VLLM_MOE_USE_DEEP_GEMM=0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document this too please in the vllm recipes

@functionstackx
Copy link
Contributor Author

@ankursingh-nv is this PR ready? if so, can u ping @cquil11 for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

b200 MiniMax M2.5 FP8

5 participants